This is the first in a series of notebooks explaining my learning process for completing a project in python. This project was based on US Census data provided by MuonNeutrino on kaggle (kaggle.com/muonneutrino/us-census-demographic-data). From this project I was able to visualise the county data on an interactive map, which was the subject of this first notebook. The second part used machine learning techniques to analyse the data, and is presented in the second notebook in this project.
#Initialisation Cell
#Import Modules
from urllib.request import urlopen
import json
import plotly.express as px
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
#Get data from github
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
url='https://raw.githubusercontent.com/SamCouch/USCensusAnalysis/master/data/acs2015_county_data.csv'
#Read the csv
f = pd.read_csv(url, error_bad_lines=False)
So to visualise the data I needed some modules and data to plot the data on a map based on county ID. To do this I use some data I found on on the plotly website. (plotly.com/python/mapbox-county-choropleth/) This data was called in the initialisation cell.
#Set focus value
foval='Income'
#Create figure
fig = px.choropleth(f, geojson=counties, locations='CensusId', color=foval,
color_continuous_scale="Viridis",
scope="usa",
)
#Show figure
fig.show()
And, well that's not right. Not all the states are there.
Upon looking at the data, I noticed that the states missing are the ones at the start of the list, which are the ones that the Census Id starts with 0. This was because the 0 was needed, but not included in the database.
At first I tried editing the csv file itself in excel, but this didn't help, as pandas still saved the values without the zeros. This also had a tendency to mess with the file, changing around some of the characters.
Instead the values were edited in the python code using pandas directly.
#Format Census ID to have 5 digits
f['CensusId']=f['CensusId'].apply(lambda x: '{0:0>5}'.format(x))
#Set focus value
foval='Income'
#Create figure
fig = px.choropleth(f, geojson=counties, locations='CensusId', color=foval,
color_continuous_scale="Viridis",
scope="usa",
labels={'County':foval},
hover_name="County",
hover_data=["State","CensusId",foval]
)
#Show figure
fig.show()
That's looking better, however, I couldn't help but notice that there were still certain grey spots on the map. To make things easier to investigate, I also edited hover menu to give more information on each county highlighted.
Upon investigation it can be seen that the greey spots in Utah and Florida could be explained by lakes, but the ones in South Dakota, Texas, Virigina and some others couldn't be explained so easily. From what I can tell they seem to be counties that don't manage themselves and are unorganised, so likely were excluded from the census data I used.
Now upon further examination there was a problem with the data in that a lot of the data was in raw values, rather than percentages. This means that maps of certain values tended to really just be population maps rather than maps which showing meaningful statistics.
#Format Census ID to have 5 digits
f['CensusId']=f['CensusId'].apply(lambda x: '{0:0>5}'.format(x))
#Set focus values
foval1='TotalPop'
foval2='Men'
#Create figures
fig1 = px.choropleth(f, geojson=counties, locations='CensusId', color=foval1,
color_continuous_scale="Viridis",
scope="usa",
labels={'County':foval1},
hover_name="County",
hover_data=["State","CensusId",foval1]
)
fig2 = px.choropleth(f, geojson=counties, locations='CensusId', color=foval2,
color_continuous_scale="Viridis",
scope="usa",
labels={'County':foval2},
hover_name="County",
hover_data=["State","CensusId",foval2]
)
#Show figures
fig1.show()
fig2.show()
This is not really the kind of statistic that is useful both for visualisations or data analysis. When wanting to know the amount of a value like men in a county, we are likely far more interested in the percentage of this value. To rectify this, we can divide the categories listing raw value by the total population.
#Format Census ID to have 5 digits
f['CensusId']=f['CensusId'].apply(lambda x: '{0:0>5}'.format(x))
#Editing the data to transform into percentages
f['Men']=round((f['Men']/f['TotalPop'])*100,2)
f['Women']=round((f['Women']/f['TotalPop'])*100,2)
f['Citizen']=round((f['Citizen']/f['TotalPop'])*100,2)
f['Employed']=round((f['Employed']/f['TotalPop'])*100,2)
#Set focus value
foval='Men'
#Create figure
fig = px.choropleth(f, geojson=counties, locations='CensusId', color=foval,
color_continuous_scale="Viridis",
scope="usa",
labels={'County':foval},
hover_name="County",
hover_data=["State","CensusId",foval]
)
#Show figure
fig.show()
Now we are given a far more meaningful map, showing where there are higher proportions of men, rather than just the population of each county.
There are other styles of map too, the following shows the whole world which is useful for this dataset in order to see Peurto Rico.
#Format Census ID to have 5 digits
f['CensusId']=f['CensusId'].apply(lambda x: '{0:0>5}'.format(x))
#Set focus value
foval='Poverty'
#Create figure
fig = px.choropleth_mapbox(f, geojson=counties, locations='CensusId', color=foval,
color_continuous_scale="Viridis",
labels={'County':foval},
hover_name="County",
hover_data=["State","CensusId",foval],
mapbox_style="carto-positron",
zoom=2, center = {"lat": 45, "lon": -110},
opacity=0.5,
)
#Show figure
fig.show()
With the visualisation of the data set done I wanted to next focus on analysing the census data using machine learning methods.